import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In this report the data collected by the Twindle will be explored. This will be done by executing the following steps, based on "A Comphrehensive Guide to Data Exploration" by Sunil Ray of analytics Vidhya.
# Load datasets
postillion_df = pd.read_csv('./data/postillion_joined.csv', parse_dates=['TimeStamp'])
big_top_df = pd.read_csv('./data/big_top_joined.csv', parse_dates=['TimeStamp'])
Each variable can either be continuous or categorical. Both need to be analyzed in separate ways. In this chapter we are going to define these types for each variable.
postillion_df.dtypes
TVOC float64 Pressure float64 CO2 float64 Illumination float64 Activity float64 TimeStamp datetime64[ns] TempInt float64 Humidity float64 DOOR_OPEN_STATUS float64 DOOR_OPEN_TIMES float64 room object dtype: object
big_top_df.dtypes
TVOC float64 Pressure float64 CO2 float64 Illumination float64 Activity float64 TimeStamp datetime64[ns] TempInt float64 Humidity float64 DOOR_OPEN_STATUS float64 DOOR_OPEN_TIMES float64 room object dtype: object
The datatype seem to be loaded mostly correct. The exception is 'DOOR_OPEN_STATUS' this should be transformed to be a categorical value, either open or closed, instead of a continuous variable.
postillion_df['DOOR_OPEN_STATUS'] = postillion_df['DOOR_OPEN_STATUS'].map({0:False, 1:True})
big_top_df['DOOR_OPEN_STATUS'] = big_top_df['DOOR_OPEN_STATUS'].map({0:False, 1:True})
def iqr_outlier_removal(df, scale=1.5):
"""Removes outliers based on the interquartile range method."""
# Calculate 1st and 3rd quantile and the range between them (iqr)
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
# Calculate the upper- and lower boundries
lower = q1 - scale * iqr
upper = q3 + scale * iqr
# Filter out non numeric data
non_numeric = df.select_dtypes(exclude=np.number).columns.values
df_in = df.select_dtypes(np.number)
# Filter dataframe based on upper- and lower boundries
df_out = df_in[~((df_in < lower) | (df_in > upper)).any(axis=1)]
print(f'Removed {df_in.shape[0] - df_out.shape[0]} outliers.')
# Rejoin non numeric data and return resulting dataframe
return df_out.join(df[non_numeric])
# Up the scale to account for the skewness of the data.
postillion_df = iqr_outlier_removal(postillion_df, scale=2)
big_top_df = iqr_outlier_removal(big_top_df, scale=2)
Removed 9994 outliers. Removed 2899 outliers.
Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only one variable. Univariate data requires to analyze each variable separately. This will be done by visualizing the data in histograms and boxplots to detect any anomalies or outliers.
def visualize_room(dataset, columns, suptitle, xlabel, ylabel, n_cols=6):
n_rows = math.ceil(len(columns) * 2 / n_cols)
# Define plot
fig, axes = plt.subplots(n_rows, n_cols)
fig.subplots_adjust(hspace=0.5, wspace=0.35)
fig.set_size_inches(n_cols * 4, n_rows * 4)
fig.suptitle(suptitle)
# Use iterator to visialize each column twice. One histogram, one boxplot.
ax_iter = iter(axes.flatten())
# Create a plot for each column
for column in columns:
data = dataset[column].dropna()
ax = next(ax_iter)
ax.set_title(f'{column} histogram')
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.hist(data)
ax = next(ax_iter)
ax.set_title(f'{column} boxplot')
ax.boxplot(data, showfliers=True, flierprops={'alpha':0.1, 'marker': '.'})
# return axes to be able to save/alter later
return axes
subset = postillion_df.select_dtypes(np.number).copy()
subset['room'] = postillion_df['room']
rooms = subset['room'].unique()
for room in rooms:
room_data = subset[subset['room'] == room]
room_data = room_data.dropna(axis=1, how='all')
columns = room_data.select_dtypes(np.number).columns.values
p_axes = visualize_room(room_data, columns, f'Distribution of Numeric Values of Postillion: {room}', 'Value', 'Count')
subset = big_top_df.select_dtypes(np.number).copy()
subset['room'] = big_top_df['room']
rooms = subset['room'].unique()
for room in rooms:
room_data = subset[subset['room'] == room]
room_data = room_data.dropna(axis=1, how='all')
# Do not visualize if there are less then 10 total datapoints for the current room
if(len(room_data) > 10):
columns = room_data.select_dtypes(np.number).columns.values
bt_axes = visualize_room(room_data, columns, f'Distribution of Numeric Values of Big Top: {room}', 'Count', 'Value')
def visualize_door_status(dataframe, rooms):
fig, axes = plt.subplots(1, len(rooms))
fig.set_size_inches(len(rooms) * 4, len(rooms))
fig.suptitle('Counts of "DOOR_OPEN_STATUS".')
for index, room in enumerate(rooms):
data = dataframe[dataframe['room'] == room]['DOOR_OPEN_STATUS'].value_counts()
if len(data) > 0:
ax = axes[index]
data.plot(kind='bar', ax=ax)
ax.set_title(room)
plt.tight_layout()
# Visualize per location per room the door open status count
subset = postillion_df[['room', 'DOOR_OPEN_STATUS']]
rooms = subset['room'].unique()
visualize_door_status(subset, rooms)
The data contains large outliers. When using the IQR outlier removal method about 20% of the data is dropped. This might indicate that there are, or were, faulty measurements being made or that there are indeed many anomalies.
There is also a difference in which data is recorded for each room. The data reading pipeline results in a dataset that has data for the most important, all air quality and temperature, columns. This means that only data for rooms a, b and c for the Postillion hotel and the boardroom for the Big Top are available.
Before dropping the outliers the data was heavily skewed. This might cause inaccuracies in regression models. To prevent this normalization techniques might need to be applied.
During bi-variate analysis the relation between two variables is explored. This will be done by creating correlation coefficient heatmaps, visualizing the most correlated features via scatterplots and seeing the change in value over time.
def draw_correlation_heatmap(data, ax):
# Extract all numeric data from the current room from the subset. DOOR_OPEN_TIMES is not needed.
data = data.select_dtypes(np.number)
# Calculate correlation coefficients.
corr = data.corr()
# Visualize
sns.heatmap(corr, ax=ax, annot=True, square=True, vmin=-1, vmax=1, cmap='coolwarm', cbar_kws={'shrink': .65})
# Invert axis to increase readability.
ax.set_title(room)
return (ax, corr)
# Loop through rooms, calculate correlation coefficients and visualize in heatmaps
subset = postillion_df.select_dtypes(np.number).copy()
subset['room'] = postillion_df['room']
rooms = subset['room'].unique()
fig, axes = plt.subplots(1, len(rooms))
fig.set_size_inches(len(rooms) * 6.66, 5)
fig.subplots_adjust(hspace=0.5, wspace=0.35)
fig.suptitle('Correlation Coefficient Heatmaps for Postillion')
corrs = []
for ax, room in zip(axes.flatten(), rooms):
data = subset[subset['room'] == room]
data = data.drop(['DOOR_OPEN_TIMES', 'Activity'], axis=1)
data = data.dropna(axis=1, how='all')
ax, corr = draw_correlation_heatmap(data, ax)
# Save room and correlation matrix for later use in scatter plots.
corrs.append((room, corr))
ax.set_title(room)
def scatter(dataframe, room, correlations, threshold=0.15, scale=1.5):
# Square correlation matrix
correlations = correlations ** 2
# for each feature collect the most correlated. If any.
for feature, row in correlations.iterrows():
# Remove reference to the feature itself within the row
row = row[row.index != feature]
# Only keep the values that are above the threshold.
row = row[row.values > threshold]
# It is possible the feature is not strongly correlated to any other. In this case we cannot visualize it.
if len(row) > 0:
# Create a figure for the current feature and a subplot for each value its going to be compared to.
fig, axes = plt.subplots(len(row), 1, squeeze=False)
fig.set_size_inches(4, len(row) * 4)
fig.suptitle(f'{room}:{feature}')
fig.subplots_adjust(hspace=0.5, wspace=0.35)
for ax, value, corr_value in zip(axes.flatten(), row.index, row.values):
# Collect data and transform.
x = df[feature]
x = np.log(x)
y = df[value]
y = np.log(y)
# visualize in scatter plot.
ax.scatter(x, y, alpha=0.1)
ax.set_title(f'{value} with a {round(math.sqrt(corr_value), 2)} correlation score.')
ax.set_xlabel(f'{feature} (log)')
ax.set_ylabel(f'{value} (log)')
# Scatterplots are sensitive to outliers. Extra outlier removal is required.
df = iqr_outlier_removal(postillion_df)
for room, correlations in corrs:
scatter(df, room, correlations)
Removed 4446 outliers. C:\Users\Koen\anaconda3\lib\site-packages\pandas\core\arraylike.py:358: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
# Loop through rooms, calculate correlation coefficients and visualize in heatmaps
subset = big_top_df.select_dtypes(np.number).copy()
subset['room'] = big_top_df['room']
rooms = subset['room'].unique()
fig, axes = plt.subplots(1, len(rooms))
fig.set_size_inches(len(rooms) * 6.66, 5)
fig.subplots_adjust(hspace=0.5, wspace=0.35)
fig.suptitle('Correlation Coefficient Heatmaps for Big Top')
corrs = []
for ax, room in zip(axes.flatten(), rooms):
data = subset[subset['room'] == room]
data = data.drop(['DOOR_OPEN_TIMES', 'Activity'], axis=1)
data = data.dropna(axis=1, how='all')
ax, corr = draw_correlation_heatmap(data, ax)
corrs.append((room, corr))
ax.set_title(room)
C:\Users\Koen\anaconda3\lib\site-packages\seaborn\matrix.py:301: UserWarning: Attempting to set identical bottom == top == 0 results in singular transformations; automatically expanding. ax.set(xlim=(0, self.data.shape[1]), ylim=(0, self.data.shape[0])) C:\Users\Koen\anaconda3\lib\site-packages\seaborn\matrix.py:301: UserWarning: Attempting to set identical left == right == 0 results in singular transformations; automatically expanding. ax.set(xlim=(0, self.data.shape[1]), ylim=(0, self.data.shape[0])) C:\Users\Koen\anaconda3\lib\site-packages\seaborn\matrix.py:301: UserWarning: Attempting to set identical bottom == top == 0 results in singular transformations; automatically expanding. ax.set(xlim=(0, self.data.shape[1]), ylim=(0, self.data.shape[0])) C:\Users\Koen\anaconda3\lib\site-packages\seaborn\matrix.py:301: UserWarning: Attempting to set identical left == right == 0 results in singular transformations; automatically expanding. ax.set(xlim=(0, self.data.shape[1]), ylim=(0, self.data.shape[0]))
df = iqr_outlier_removal(big_top_df)
for room, correlations in corrs:
scatter(df, room, correlations)
Removed 1790 outliers. C:\Users\Koen\anaconda3\lib\site-packages\pandas\core\arraylike.py:358: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
When looking at the heatmaps and scatterplots we can conclude that there few strongly related features. The strength of the also differs per room, indicating that there might be other factors that needs to be considered.
The features of the dataset can be used to define the air quality of a room. It needs to be expanded with data that cause changes to these features.
During the preparation phase the event based data is converted to tabular data. This is done by joining the events based on timestamps. When there are no events within range this will result in NaN values, as can be seen below. During the model development a pipeline step could be added to impute the missing data.
def calculate_percentage_of_missing(df):
return round(df.isnull().sum() * 100 / len(df), 2)
missing = calculate_percentage_of_missing(postillion_df)
ax = missing.plot(kind='bar', title='Missing Data for Postillion Dataset')
ax.set_xlabel('Column')
ax.set_ylabel('% of Data Missing')
Text(0, 0.5, '% of Data Missing')
missing = calculate_percentage_of_missing(big_top_df)
ax = missing.plot(kind='bar', title='Missing Data for Big Top Dataset')
ax.set_xlabel('Column')
ax.set_ylabel('% of Data Missing')
Text(0, 0.5, '% of Data Missing')
The 'DOOR_OPEN_TIMES' columns contains the total number of times the door has been opened. Seeing as this count is not restarted each day and that most of the values are missing this column can probably be dropped. The rest of the missing data ranges between ~5% - 60%. When training a model this data needs to be either dropped or imputed.
for the outlier treatment see the 'iqr_outlier_removal' function. The current outlier removal method drops ~20% of the datapoints. If this is too much the quantiles can be tweaked to contain less data points.
This Exploratory Data Analysis was conducted to see how the data collected by the Twindle application is put together. The following observations have been made:
Based on these observations the data requirements can be researched. There needs to be a single feature that indicates air quality, based on temperature, humidity, etc. Data that signal changes in the current features need to be added to the dataset. Together this can be used to develop a machine learning model that is able to predict the air quality.